Support Vector Machine Classification with Indefinite Kernels 
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Abstract 



We propose a method for support vector machine classification using indefinite kernels. In- 
^ ■ stead of directly minimizing or stabilizing a nonconvex loss function, our algorithm simultane- 

| ously computes support vectors and a proxy kernel matrix used in forming the loss. This can be 

interpreted as a penalized kernel learning problem where indefinite kernel matrices are treated 
as noisy observations of a true Mercer kernel. Our formulation keeps the problem convex and 
relatively large problems can be solved efficiently using the projected gradient or analytic center 
\^ ' cutting plane methods. We compare the performance of our technique with other methods on 

■ several standard data sets. 

O 

1 Introduction 

Support vector machines (SVM) have become a central tool for solving binary classification prob- 
lems. A critical step in support vector machine classification is choosing a suitable kernel matrix, 
which measures similarity between data points and must be positive semidefinite because it is 
formed as the Gram matrix of data points in a reproducing kernel Hilbert space. This positive 
semidefinite condition on kernel matrices is also known as Mercer's condition in the machine learn- 
ing literature. The classification problem then becomes a linearly constrained quadratic program. 
Here, we present an algorithm for SVM classification using indefinite kernels^, he. kernel matrices 
formed using similarity measures which are not positive semidefinite. 

Our interest in indefinite kernels is motivated by several observations. First, certain similar- 
ity measures take advantage of application-specific structure in the data and often display excel- 
lent empirical classification performance. Unlike popular kernels used in support vector machine 
classification, these similarity matrices are often indefinite, so do not necessarily correspond to a 
reproducing kernel Hilbert space. (See Ong et al. (2004) for a discussion.) 

In particular, an application of classification with indefinite kernels to image classification us- 
ing Earth Mover's Distance was discussed in Zamolotskikh and Cunningham (2004). Similarity 
measures for protein sequences such as the Smith- Waterman and BLAST scores are indefinite yet 
have provided hints for constructing useful positive semidefinite kernels such as those decribed in 
Saigo et al. (2004) or have been transformed into positive semidefinite kernels with good empirical 
performance (see Lanckriet et al. (2003), for example). Tangent distance similarity measures, as 
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described in Simard et al. (1998) or Haasdonk and Keysers (2002), are invariant to various simple 
image transformations and have also shown excellent performance in optical character recognition. 
Finally, it is sometimes impossible to prove that some kernels satisfy Mercer's condition or the 
numerical complexity of evaluating the exact positive kernel is too high and a proxy (and not 
necessarily positive semidefinite) kernel has to be used instead (see Cuturi (2007), for example). 
In both cases, our method allows us to bypass these limitations. Our objective here is to derive 
efficient algorithms to directly use these indefinite similarity measures for classification. 

Our work closely follows, in spirit, recent results on kernel learning (see Lanckriet et al. (2004) or 
Ong et al. (2005)), where the kernel matrix is learned as a linear combination of given kernels, and 
the result is explicitly constrained to be positive semidefinite. While this problem is numerically 
challenging, Bach et al. (2004) adapted the SMO algorithm to solve the case where the kernel 
is written as a positively weighted combination of other kernels. In our setting here, we never 
numerically optimize the kernel matrix because this part of the problem can be solved explicitly, 
which means that the complexity of our method is substantially lower than that of classical kernel 
learning algorithms and closer in practice to the algorithm used in Sonnenberg et al. (2006), who 
formulate the multiple kernel learning problem of Bach et al. (2004) as a semi-infinite linear program 
and solve it with a column generation technique similar to the analytic center cutting plane method 
we use here. 

1.1 Current results 

Several methods have been proposed for dealing with indefinite kernels in SVMs. A first direction 
embeds data in a pseudo-Euclidean (pE) space: Haasdonk (2005), for example, formulates the 
classification problem with an indefinite kernel as that of minimizing the distance between convex 
hulls formed from the two categories of data embedded in the pE space. The nonseparable case is 
handled in the same manner using reduced convex hulls. (See Bennet and Bredensteiner (2000) for 
a discussion on geometric interpretations in SVM.) 

Another direction applies direct spectral transformations to indefinite kernels: flipping the neg- 
ative eigenvalues or shifting the eigenvalues and reconstructing the kernel with the original eigen- 
vectors in order to produce a positive semidefinite kernel (see Wu et al. (2005) and Zamolotskikh 
and Cunningham (2004), for example). Yet another option is to reformulate either the maximum 
margin problem or its dual in order to use the indefinite kernel in a convex optimization problem. 
One reformulation suggested in Lin and Lin (2003) replaces the indefinite kernel by the identity 
matrix and maintains separation using linear constraints. This method achieves good performance, 
but the convexification procedure is hard to interpret. Directly solving the nonconvex problem 
sometimes gives good results as well (see Woznica et al. (2006) and Haasdonk (2005)) but offers no 
guarantees on performance. 

1.2 Contributions 

In this work, instead of directly transforming the indefinite kernel, we simultaneously learn the 
support vector weights and a proxy Mercer kernel matrix by penalizing the distance between this 
proxy kernel and the original, indefinite one. Our main result is that the kernel learning part of that 
problem can be solved explicitly, meaning that the classification problem with indefinite kernels 
can simply be formulated as a perturbation of the positive semidefinite case. 
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Our formulation can be interpreted as a penalized kernel learning problem with uncertainty 
on the input kernel matrix. In that sense, indefinite similarity matrices are seen as noisy observa- 
tions of a true positive semidefinite kernel and we learn a kernel that increases the generalization 
performance. From a complexity standpoint, while the original SVM classification problem with 
indefinite kernel is nonconvex, the penalization we detail here results in a convex problem, and 
hence can be solved efficiently with guaranteed complexity bounds. 

The paper is organized as follows. In Section [2] we formulate our main classification result and 
detail its interpretation as a penalized kernel learning problem. In Section [3] we describe three 
algorithms for solving this problem. Section U] discusses several extensions of our main results. 
Finally, in Section El we test the numerical performance of these methods on various data sets. 

Notation 

We write S n (S™ ) to denote the set of symmetric (positive-semidefinite) matrices of size n. The 
vector e is the ra-vector of ones. Given a matrix X, Aj (X) denotes the i th eigenvalue of X. X + 
is the positive part of the matrix X, i.e. X + = ^ max(0, XijVivJ where Aj and Vi are the i th 
eigenvalue and eigenvector of the matrix X. Given a vector x, IMIi = X) 



2 SVM with indefinite kernels 

In this section, we modify the SVM kernel learning problem and formulate a penalized kernel 
learning problem on indefinite kernels. We also detail how our framework applies to kernels that 
satisfy Mercer's condition. 



2.1 Kernel learning 

Let K € S n be a given kernel matrix and let y € R n be the vector of labels, with Y = diag(y), the 
matrix with diagonal y. We formulate the kernel learning problem as in Lanckriet et al. (2004), 
where the authors minimize an upper bound on the misclassification probability when using SVM 
with a given kernel K. This upper bound is the generalized performance measure 

u) C (K)= max a T e-Tr(K(Ya){Ya) T )/2 (1) 

{0<a<C,a T y=0} 

where a £ R n and C is the SVM misclassification penalty. This is also the classic 1-norm soft 
margin SVM problem. They show that ujc(K) is convex in K and solve problems of the form 

minujc(K) (2) 
KeK 

in order to learn an optimal kernel K* that achieves good generalization performance. When 
K, is restricted to convex subsets of <S™ with constant trace, they show that problem ([2]) can 
be reformulated as a convex program. Further restrictions to /C reduce ([2]) to more tractable 
optimization problems such as semidefinite and quadratically constrained quadratic programs. Our 
goal is to solve a problem similar to ([2]) by restricting the distance between a proxy kernel used in 
classification and the original indefinite similarity measure. 
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2.2 Learning from indefinite kernels 

The performance measure in ([1]) is the dual of the SVM classification problem with hinge loss and 
quadratic penalty. When K is positive semidefinite, this problem is a convex quadratic program. 
Suppose now that we are given an indefinite kernel matrix Kq £ S n . We formulate a new instance of 
problem ([2]) by restricting K to be a positive semidefinite kernel matrix in some given neighborhood 
of the original (indefinite) kernel matrix Kq and solve 

min max a T e - - Tr( K (Y a) (Ya) T ) 

{KtO, \\K-K \\%<P} {a T y=0, 0<a<C} 2 

in the variables K G S n and a € R n , where the parameter (3 > controls the distance between 
the original matrix Kq and the proxy kernel K. This is the kernel learning problem ([2]) with 
K, = {K y 0, \\K — Kq\\p < /?}. The above problem is infeasible for small values of (3, so we replace 
here the hard constraint on K by a penalty p on the distance between the proxy kernel and the 
original indefinite similarity matrix and solve instead 

min max a T e - - Tr( K(Y a) (Ya) T ) + p\\K - Kq\\1 (3) 

{KhO} {* T y=o, o< a <C} 2 U|IF W 

Because (J3j) is convex-concave and the inner maximization has a compact feasible set, we can switch 
the max and min to form the dual 

max min a T e - - Tr( K(Ya)(Ya) T ) + p\\K - Kq\\% (4) 

in the variables K € S n and a E R n . 

We first note that problem §S§ is a convex optimization problem. The inner minimization 
problem is a convex conic program on K. Also, as the pointwise minimum of a family of concave 
quadratic functions of a, the solution to the inner problem is a concave function of a, hence the 
outer optimization problem is also convex (see Boyd and Vandenberghe (2004) for further details). 
Thus, ([4]) is a concave maximization problem subject to linear constraints and is therefore a convex 
problem in a. Our key result here is that the inner kernel learning optimization problem in can 
be solved in closed form. 

Theorem 1 Given a similarity matrix Kq G S n , a vector a € R n of support vector coefficients 
and the label matrix Y = diag(y), the optimal kernel in problem ^ can be computed explicitly as: 

K* = (KQ + (Ya)(Yaf/(4p)) + (5) 

where p > controls the penalty. 

Proof. For a fixed a, the inner minimization problem can be written out as 

man a T e + p(Tr(K T K) -2Tr(K T {K + ^-(Ya)(Ya) T )) + Tr(K%K )) 

where we have replaced \\K — -KoIIf = Tr((if — Kq) t (K — Kq)) and collected similar terms. Adding 
and subtracting the constant p r Cr((KQ-\-j^(Ya)(Ya) T ) T (KQ + -^(Ya)(Ya) T )) shows that the inner 
minimization problem is equivalent to the problem 

minimize \\K - {Kq + ±(Ya)(Ya) T ) \\ 2 F 
subject to K y 
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in the variable K G S n , where we have dropped the remaining constants from the objective. This 
is the projection of the matrix Kq + (Ya)(Ya) T / (Ap) on the cone of positive semidefinite matrices, 
which yields the desired result. ■ 

Plugging the explicit solution for the proxy kernel derived in © into the classification prob- 
lem @, we get 

max a T e-]-TT(K*{Ya){Ya) T )+ p\\K* -K \\ 2 F (6) 

{a T y=0, 0<q<C} 2 

in the variable a E R n , where (Ya)(Ya) T is the rank one matrix with coefficients yidiaji/j. Problem 
([6]) can be cast as an eigenvalue optimization problem in the variable a. Letting the eigenvalue 
decomposition of A + (Ya)(Ya) T /(Ap) be VDV T , we get A* = VD + V T , and with V{ the i th 
column of V, we can write 

Tr(K*(Ya)(Ya) T ) = (Y a) T V D + V T (Y a) 

= ^max (o, X t (k + -^(Ya)(Ya) T ^ {a T Y Vi f. 

Using the same technique, we can also rewrite the term \\K* — Ao||f^ using this eigenvalue decom- 
position. Our original optimization problem (JH) finally becomes 

maximize a T e — \ Y^i max(0, \(Kq + (Ya)(Ya) T /Ap))(a T Yvi) 2 
+ P E t (max(0, A 4 (K + (Ya)(Ya) T ' /Ap))) 2 

-2pJ2i T r {( v iV?)K )m&x(0, \i(K + {Ya)(Ya) T /Ap)) + pTr(K K ) 
subject to a T y = 0, < a < C 



in the variable a € R n . By construction, the objective function is concave, hence ([7]) is a convex 
optimization problem in a. 

A reformulation of problem appears in Chen and Ye (2008) where the authors move the 
inner minimization problem to the constraints and get the following semi-infinite quadratically 
constrained linear program (SIQCLP): 

maximize t 

subject to a T y = 0, < a < C ( 8 ) 
t < a T e - \ Tr(K(Ya)(Ya) T ) + p\\K — A ||| VA h 0. 

In Section [3J we describe algorithms to solve our eigenvalue optimization problem in ([7]), as well as 
an algorithm from Chen and Ye (2008) that solves the different formulation in ([8]), for completeness. 

2.3 Interpretation 

Our explicit solution of the optimal kernel given in ([SD is the projection of a penalized rank-one 
update to the indefinite kernel on the cone of positive semidefinite matrices. As p tends to infinity, 
the rank-one update has less effect and in the limit, the optimal kernel is given by zeroing out the 



5 



negative eigenvalues of the indefinite kernel. This means that if the indefinite kernel contains a 
very small amount of noise, the best positive semidefinite kernel to use with SVM in our framework 
is the positive part of the indefinite kernel. 

This limit as p tends to infinity also motivates a heuristic for transforming the kernel on the 
testing set. Since negative eigenvalues in the training kernel are thresholded to zero in the limit, 
the same transformation should occur for the test kernel. Hence, to measure generalization perfor- 
mance, we update the entries of the full kernel corresponding to training instances by the rank-one 
update resulting from the optimal solution to (|7|) and threshold the negative eigenvalues of the full 
kernel matrix to zero to produce a Mercer kernel on the test set. 

2.4 Dual problem 

As discussed above, problems ([3]) and (|4|) are dual. The inner maximization in problem ([3]) is a 
quadratic program in a, whose dual is the quadratic minimization problem 

minimize \{e — b + [i + yv) T {Y KY)~ l [e — 5 + p + yv) + Cp T e , , 

subject to 5, p > 0. 

Substituting ([9]) for the inner maximization in problem Q allows us to write a joint minimization 
problem 

minimize Tr{K~ 1 (Y~ 1 {e - 5 + p + yv)){Y~ 1 (e -5 + p,+ yu)) T )/2 + Cp T e + p\\K - K \\ 2 F 
subject to K y 0, <5, p > 

(10) 

in the variables K E S" , 5, p € R n and v € R. This is a quadratic program in the variables 5, p 
(which correspond to the constraints < a < C) and v (which is the dual variable for the constraint 
a T y = 0). As we have seen earlier, any feasible solution a £ R n produces a corresponding proxy 
kernel in ([5]). Plugging this kernel into problem ()10p allows us to compute an upper bound on the 
optimum value of problem @ by solving a simple quadratic program in the variables 5, //, v. This 
result can then be used to bound the duality gap in ([7]) and track convergence. 

3 Algorithms 

We now detail two algorithms that can be used to solve problem ([7|), which maximizes a nondiffer- 
entiable concave function subject to convex constraints. An optimal point always exists since the 
feasible set is bounded and nonempty. For numerical stability, in both algorithms, we quadratically 
smooth our objective to compute a gradient. We first describe a simple projected gradient method 
which has numerically cheap iterations but less predictable performance in practice. We then show 
how to apply the analytic center cutting plane method, whose iterations are numerically more com- 
plex but which converges linearly. For completeness, we also describe an exchange method from 
Chen and Ye (2008) used to solve problem ([8]), where the numerical bottleneck is a quadratically 
constrained linear program solved at each iteration. 

Smoothing Our objective contains terms of the form max{0, f(x)} for some function f(x), which 
are not differentiable (described in the section below). These functions are easily smoothed out by 
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a Moreau-Yosida regularization technique (see Hiriart-Urruty and Lemarechal (1993), for example). 
We replace the max by a continuously differentiable ^-approximation as follows: 



e 2n 



<Pe(f(*)) = max (uf(x) - -u 

0<u<l A 

The gradient is then given by Vy? e (/(x)) = u*(x)Vf(x) where u*{x) = argmax<^ e (/(x)). 



Gradient Calculating the gradient of the objective function in ([7]) requires computing the eigen- 
value decomposition of a matrix of the form X{a) = K+paa T . Given a matrix X(a), the derivative 
of the i th eigenvalue with respect to a is then given by 

d\i(X(a)) _ T dX(a) 

da ~ Vl da Vi 1 ' 

where Vi is the i th eigenvector of X(a). We can then combine this expression with the smooth 
approximation above to obtain the gradient. 



3.1 Computing proxy kernels 

Because the proxy kernel in ([5]) only requires a rank one update of a (fixed) eigenvalue decomposition 

K* = (K + (Ya)(Ya) T /(Ap)) + , 

we now briefly recall how V{ and \(X{a)) can be computed efficiently in this case (see Demmel 
(1997) for further details). We refer the reader to Kulis et al. (2006) for another kernel learning 
example using this method. Given the eigenvalue decomposition X = VDV T , by changing basis 
this problem can be reduced to the decomposition of the diagonal plus rank-one matrix, D + puu T , 
where u = V T a. First, the updated eigenvalues are determined by solving the secular equations 

det(.D + puu T - XI) = 0, 

which can be done in 0(n 2 ). While there is an explicit solution for the eigenvectors corresponding to 
these eigenvalues, they are not stable because the eigenvalues are approximated. This instability is 
circumvented by computing a vector u such that approximate eigenvalues A are the exact eigenvalues 
of the matrix D + puu T , then computing its stable eigenvectors explicitly, where both steps can 
be done in 0(n 2 ) time. The key is that D + puu T is close enough to our original matrix so 
that the eigenvalues and eigenvectors are stable approximations of the true values. Finally, the 
eigenvectors of our original matrix are computed as VW, with W as the stable eigenvectors of 
D + puu T . Updating the eigenvalue decomposition is reduced to an 0(n 2 ) procedure plus one 
matrix multiplication, which is then the complexity of one gradient computation. 

We note that eigenvalues of symmetric matrices are not differentiable when some of them have 
multiplicities greater than one (see Overton (1992) for a discussion), but a subgradient can be used 
instead of the gradient in all the algorithms detailed here. Lewis (1999) shows how to compute an 
approximate sub differential of the k-th largest eigenvalue of a symmetric matrix. This can then be 
used to form a regular subgradient of the objective function in §J§ which is concave by construction. 
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Algorithm 1 Projected gradient method 
1: Compute ctj+i = a% + tVf(di). 

2: Set CXj+l = 

3: If gap < e stop, otherwise go back to step 1. 



3.2 Projected gradient method 

The projected gradient method takes a steepest descent step, then projects the new point back 
onto the feasible region (see Bertsekas (1999), for example). We choose an initial point oq G R™ 
and the algorithm proceeds as in Algorithm [TJ 

Here, we have assumed that the objective function is differentiable (after smoothing). The 
method is only efficient if the projection step is numerically cheap. The complexity of each iteration 
then breaks down as follows: 

Step 1. This requires an eigenvalue decomposition that is computed in 0(n 2 ) plus one matrix 
multiplication as described above. Experiments below use a stepsize of 5/k for IndefiniteSVM and 
10/ k for PerturbS VM (described in Section [4.3p where k is the iteration number. A good stepsize 
is crucial to performance, and must be chosen separately for each data set as there is no rule of 
thumb. We note that a line search would be costly here because it would require multiple eigenvalue 
decompositions to recalculate the objective multiple times. 

Step 2. This is a projection onto the region A = {a T y = 0, < a < C} and can be solved explicitly 
by sorting the vector of entries, with cost 0{n log n). 

Stopping Criterion. We can compute a duality gap using the results of §2.41 where 

K i = (K + {Ya i )(Ya i ) T /{Ap)) + 

is the candidate kernel at iteration i and we solve problem ([1]), which simply means solving a SVM 
problem with the positive semidefinite kernel Ki, and produces an upper bound on ([7]), hence a 
bound on the suboptimality of the current solution. 

Complexity. The number of iterations required by this method to reach a target precision of e 
grows as 0(l/e 2 ). See Nesterov (2003) for a complete discussion. 

3.3 Analytic center cutting plane method 

The analytic center cutting plane method (ACCPM) reduces the feasible region at each iteration 
using a new cut computed by evaluating a subgradient of the objective function at the analytic 
center of the current feasible set, until the volume of the reduced region converges to the target 
precision. This method does not require differentiability. We set Cq = {x G R" | x T y = 0, < x < 
C}, which we can write as {x G R n | Aqx < bo}, to be our first localization set for the optimal 
solution. The method is described in Algorithm [2] (see Bertsekas (1999) for a more complete 
treatment of cutting plane methods). 
The complexity of each iteration breaks down as follows: 

Step 1. This step computes the analytic center of a polyhedron and can be solved in 0(n 3 ) 
operations using interior point methods, for example. 

Step 2. This simply updates the polyhedral description. It includes the gradient computation which 
again is 0(n 2 ) plus one matrix multiplication. 
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Algorithm 2 Analytic center cutting plane method 
1: Compute ttj as the analytic center of by solving 

m 

x i+ i = argmin - ^ log(6j - afx) 
xeR i=i 

where af represents the i th row of coefficients from the left-hand side of {x £ R n | AiX < bo}. 
2: Compute V/(x) at the center Xj+i and update the (polyhedral) localization set 

c i+ i = Ci n {s/f(x i+ i)(x - x i+ i) > o} 

where / is objective in problem ((JJ). 
3: If m > 3ra, reduce the number of constraints to 3n. 
4: If gap < e stop, otherwise go back to step 1. 



Step 3. This step requires ordering the constraints according to their relevance in the localization 
set. One relevance measure for the j th constraint at iteration i is 

ajvy^rs (12) 

(ajxi - bj) 2 

where / is the objective function of the analytic center problem. Computing the hessian is easy: 
it requires matrix multiplication of the form A T DA where iismxn (matrix multiplication is 
kept inexpensive in this step by pruning redundant constraints) and D is diagonal. Restricting 
the number of constraints to 3n is a rule of thumb; raising this limit increases the per iteration 
complexity while decreasing it increases the required number of iterations. 

Stopping Criterion. An upper bound is computed by maximizing a first order Taylor approximation 
of /(a) at on over all points in an ellipsoid that covers Ai, which can be computed explicitly. 
Complexity. ACCPM is provably convergent in 0(n(log 1/e) 2 ) iterations when using cut elimina- 
tion, which keeps the complexity of the localization set bounded. Other schemes are available with 
slightly different complexities: a bound of 0(n 2 /e 2 ) is achieved in Goffin and Vial (2002) using 
(cheaper) approximate centers, for example. 

3.4 Exchange method for SIQCLP 

The algorithm considered in Chen and Ye (2008) in order to solve problem (jSJ) falls under a class of 
algorithms called exchange methods (as defined in Hettich and Kortanek (1993)). These methods 
iteratively solve problems constrained by a finite subset of the infinitely many constraints, where 
the solution at each iterate gives an improved lower bound to the maximization problem. The 
subproblem solved at each iteration here is 

maximize t 

subject to a T y = 0, < a < C (13) 
t < a T e - \ Tr (K i(Y a) (Y a) T ) + P \\IU - K \\ 2 F i = l,...,p 
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where p is the number of constraints used to approximate the infinitely many constraints of problem 
(jHJ). Let (ti,Q!i) be an initial solution found by solving problem ([TBI) with p = 1 and K\ = (Kq) + , 
where Kq is the input indefinite kernel. The algorithm proceeds as in Algorithm [3] below. 

Algorithm 3 Exchange method 
1: Compute Ki + i by solving the inner minimization problem of ([3D as a function of «j. 
2: Stop if 

aje - - Tr(K i+1 {Ya t )(Ya t ) T ) + p\\K l+1 - K \\ 2 F > U. 

3: Solve problem (|13p with an additional constraint using -fQ+i to get (tj+i, and go back to 

step 1. 



The complexity of each iteration breaks down as follows: 

Step 1. This step can be solved analytically using Theorem [TJ An efficient calculation of can 
be made as in the other algorithms above using an 0(n 2 ) procedure plus one matrix multiplication. 
Step 2 (Stopping Criterion). The previous point (ti,a.i) is optimal if it is feasible with respect to 
the new constraint, in which case it is feasible for the infinitely many constraints of the original 
problem ([8]) and hence also optimal. 

Step3. This step requires solving a QCLP with a number of quadratic constraints equivalent to the 
number of iterations. As shown in Chen and Ye (2008), the QCLP can be written as a regularized 
version of the multiple kernel learning (MKL) problem from Lanckriet et al. (2004), where the 
number of constraints here is equivalent to the number of kernels in MKL. Efficient methods to 
solve MKL with many kernels is an active area of research, most recently in Rakotomamonjy et al. 
(2008). There, the authors use a gradient method to solve a reformulation of problem f)13f) as a 
smooth maximization problem. Each objective value and gradient computation requires computing 
a support vector machine, hence each iteration requires several SVM computations which can be 
speeded up using warm-starting. Furthermore, Chen and Ye (2008) prune inactive constraints at 
each iteration in order to decrease the number of constraints in the QCLP. 

Complexity. No rate of convergence is known for this algorithm, but the duality gap given in Chen 
and Ye (2008) is shown to monotonically decrease. 

3.5 Matlab Implementation 

The first two algorithms discussed here were implemented in Matlab for the cases of indefinite 
(IndefiniteSVM) and positive semidefinite (PerturbSVM) kernels and can be downloaded from the 
authors' webpages in a package called IndefiniteSVM. The p penalty parameter is one-dimensional 
in the implementation. This package makes use of the LIBSVM code of Chang and Lin (2001) to 
produce suboptimality bounds and track convergence. A Matlab implementation of the exchange 
method (due to the authors of Chen and Ye (2008)) that uses MOSEK (MOSEK ApS 2008) to 
solve problem (113[> is compared against the projected gradient method in Section [5j 

4 Extensions 

In this section, we extend our results to other kernel methods, namely support vector regressions 
and one-class support vector machines. In addition, we apply our method to using Mercer kernels 
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and show how to use more general penalties in our formulation. 



4.1 SVR with indefinite kernels 

The practicality of indefinite kernels in SVM classification similarly motivates using indefinite 
kernels in support vector regression (SVR) . We here extend the formulations in Section [2] to SVR 
with linear e-insensitive loss 

uc(K)= max a T y-e\a\ -Tr(Kaa T )/2 (14) 

where a S R n and C is the SVR penalty parameter. The indefinite SVR formulation follows 
directly as in Section [2.21 and the optimal kernel is learned by solving 

max min a T y — e\a\ Tr(Kaa T ) + p\\K — Ka\\ \ (15) 

{ a T e =o,-C<a<c} {Kyo} y 1 1 2 v ' HU U|IF v ; 

in the variables K £ S n and a € R n , where the parameter p > controls the magnitude of the 
penalty on the distance between K and Kq . The following corollary to Theorem [1] provides the 
solution to the inner minimization problem in (|15p 

Corollary 2 Given a similarity matrix Kq £ S n and a vector a € R™ of support vector coefficients, 
the optimal kernel in problem i!5\) can be computed explicitly as 

K* = (K + aa T /(4p)) + (16) 

where p > controls the penalty. 

The proof follows directly as in Theorem [TJ the slight difference is that the vector of labels y does 
not appear in the optimal kernel. Plugging in (|16j) into (|15l) . the resulting formulation can be 
rewritten as the convex eigenvalue optimization problem 

maximize a T y — e\a\ — | J2i max (.®> ^i(Ko + cta T /{Ap))){a T Vi) 2 
+pJ2i (max(0, Xi(K + aa T /4p))) 2 

-2pEi Tr((vivf)Ko)max(0, \{K Q + aa T /(4p))) + pTr(K K ) 
subject to a T e = 0, — C < a < C 

in the variable a € R n . Again, a proxy kernel given by (|16p can be produced from any feasible 
solution a € R™. Plugging the proxy kernel into problem (|15p allows us to compute an upper 
bound on the optimum value of problem (|15|) by solving a support vector regression problem. 

4.2 One-class SVM with indefinite kernels 

The same reformulation can also be applied to one-class support vector machines which have the 
formulation (see Scholkopf and Smola (2002)) 

U) V {K) = max -Tr(Kaa T )/2 (18) 

{0<a<i,« T e=l} 
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where a G R n , v is the one-class SVM parameter, and I is the number of training points. The 
indefinite one-class SVM formulation follows again as done for binary SVM and SVR; the optimal 
kernel is learned by solving 

max min Tr( Kaa T ) + p\\K - K \\l (19) 

{aT e =l,0<a<±} {Kto} 2 

in the variables K G S™ and a G R™. The inner minimization problem is identical to that of 
indefinite SVR and the optimal kernel has the same form as given in Corollary [2j Plugging (|16p 
into (|19p gives another convex eigenvalue optimization problem 

maximize — ^ J2i max(0, \i{Kq + aa T /4p))(a T Vi) 2 
+pJ2 l (max(0, Xi(K + aa T /(Ap)))) 2 

-2pE* TV((^ T )if )max(0, \{K Q + aa T /(4p))) + P Tr(K K ) 
subject to a T e = 1, < a < 

in the variable a G R n , which is identical to (|17|) without the first two terms in the objective 
and slightly different constraints. The algorithm follows almost directly the same as above for the 
indefinite SVR formulation. 



4.3 Learning from Mercer kernels 

While our central motivation is to use indefinite kernels for SVM classification, one would also 
like to analyze what happens when a Mercer kernel is used as input in In this case, we 

learn another kernel that decreases the upper bound on generalization performance and produces 
perturbed support vectors. We can again interpret the input as a noisy kernel, and as such, one 
that will achieve suboptimal performance. If the input kernel is the best kernel to use (i.e. is not 
noisy), we will observe that our framework achieves optimal performance as p tends to infinity 
(through cross validation), otherwise we simply learn a better kernel using a finite p. 

When the similarity measure Kq is positive semidefinite, the proxy kernel K* in Theorem Q] 
simplifies to a rank-one update of Kq 

K* = K + (Ya*)(Ya*f/(4p) (21) 

whereas, for indefinite Kq, the solution was to project this matrix on the cone of positive semidefinite 
matrices. Plugging (f2~T1) into problem (jlj) gives: 

max aFe-lTr(K (Ya)(Ya) T )--L y £(a i a j ) 2 , (22) 

{a T y=0, 0<a<C} 2 16/9 z ~ ' 

which is the classic SVM problem given in ([1]) with a fourth order penalty on the support vectors. 
For testing in this framework, we do not need to transform the kernel, only the support vectors 
are perturbed. In this case, computing the gradient no longer requires eigenvalue decompositions 
at each iteration. Experimental results are shown in Section [5j 
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4.4 Componentwise penalties 

Indefinite SVM can be generalized further with componentwise penalties on the distance between 
the proxy kernel and the indefinite kernel Kq. We generalize problem @ to 

/ t m ^ ^ A ~ \ ^(Ya^Yaf) + E H niKij ~ K 0ij ) 2 (23) 

{a 1 y=Q ,0<a<C \ £ ■ ■ 

where H is now a matrix of varying penalties on the componentwise distances. For a specific class 
of penalties, the optimal kernel K* can be derived explicitly as follows. 

Theorem 3 Given a similarity matrix Kq £ S n , a vector a £ R n of support vector coefficients and 
the label matrix Y = diag(y), when H is rank-one with H^ = hihj, the optimal kernel in problem 
l23\) has the explicit form 

K* = W- 1/2 ((W 1/2 (K + h K W~ 1 Ya*){W~ 1 Ya*) T )W ll2 ) + )W~ 112 (24) 
where W is the diagonal matrix with W„ = hi. 

Proof. The inner minimization problem to problem (|23|) can be written out as 

Adding and subtracting ^ • Hij(Koij + jjj—yiVjOiiaj) 2 , combining similar terms, and removing 
remaining constants gives 

minimize WH 1 / 2 o (K - {K + ^ o (Ya)(Ya) T ))\\ 2 F 
subject to K y 

where o denotes the Hardamard product, (A o B)ij = aijbij, (H 1 / 2 )^ = Hij 2 , and (^j)ij = 

This is a weighted projection problem where Hij is the penalty on (Kij — Koij) 2 . Since H is 
rank-one, the result follows from Theorem 3.2 of Higham (2002). ■ 

Notice that Theorem [3] is a generalization of Theorem [1] where we had H = ee T . In constructing 
a rank-one penalty matrix H, we simply assign penalties to each training point. The componentwise 
penalty formulation can also be extended to true kernels. If Kq y 0, then K* in Theorem[3]simplifies 
to a rank-one update of Kq: 

K* = Kq + ^{W- l / 2 Ya){W- l / 2 Ya) T (25) 
where no projection is required. 



5 Experiments 

In this section we compare the generalization performance of our technique to other methods apply- 
ing SVM classification to indefinite similarity measures. We also examine classification performance 
using Mercer kernels. We conclude with experiments showing convergence of our algorithms. All 
experiments on Mercer kernels use the LIBSVM library. 
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5.1 Generalization with indefinite kernels 

We compare our method for SVM classification with indefinite kernels to several kernel preprocess- 
ing techniques discussed earlier. The first three techniques perform spectral transformations on the 
indefinite kernel. The first, called denoise here, thresholds the negative eigenvalues to zero. The 
second transformation, called flip, takes the absolute value of all eigenvalues. The last transforma- 
tion, shift, adds a constant to each eigenvalue, making them all positive. See Wu et al. (2005) for 
further details. We also implemented an SVM modification (denoted Mod SVM) suggested in Lin 
and Lin (2003) where a nonconvex quadratic objective function is made convex by replacing the 
indefinite kernel with the identity matrix. The kernel only appears in linear inequality constraints 
that separate the data. Finally, we compare our results with a direct use of SVM classification on 
the original indefinite kernel (SVM converges but the solution is only a stationary point and not 
guaranteed to be optimal). 

We first experiment on data from the USPS handwritten digits database Hull (1994) using 
the indefinite Simpson score and the one-sided tangent distance kernel to compare two digits. 
The tangent distance is a transformation invariant measure — it assigns high similarity between an 
image and slightly rotated or shifted instances — and is known to perform very well on this data 
set. Our experiments symmetrize the one-sided tangent distance using the square of the mean 
tangent distance defined in Haasdonk and Keysers (2002) and make it a similarity measure by 
negative exponentiation. We also consider the Simpson score for this task, which is much cheaper 
to compute (a ratio comparing binary pixels). We finally analyze three data sets (diabetes, german 
and ala) from the UCI repository (Asuncion and Newman 2007) using the indefinite sigmoid kernel. 

The data is randomly divided into training and testing data. We apply 5-fold cross validation 
and use an average of the accuracy and recall measures (described below) to determine the optimal 
parameters C, p, and any kernel inputs. We then train a model with the full training set and 
optimal parameters and test on the independent test set. 

Table [1] provides summary statistics for these data sets, including the minimum and maximum 
eigenvalues of the training similarity matrices. We observe that the Simpson are highly indefinite, 
while the one-sided tangent distance kernel is nearly positive semidefinite. The spectrum of sigmoid 
kernels varies greatly across examples because it is very sensitive to the sigmoid kernel parameters. 
Table [2] compares accuracy, recall, and their average for denoise, flip, shift, modified SVM, direct 
SVM and the indefinite SVM algorithm described in this work. 

Based on the interpretation from Section [231 Indefinite SVM should be expected to perform at 
least as well as denoise; if denoise were a good transformation, then cross-validation over p should 
choose a high penalty that makes Indefinite SVM and denoise nearly equivalent. The rank-one 
update provides more flexibility for the transformation and similarities concerning data points 
that are easily classified (aj = 0) are not modified by the rank-one update. Further interpretation 
for the specific rank-one update is not currently known. However, Chen et al. (2009) recently 
proposed spectrum modifications in a similar manner to Indefinite SVM. Rather than perturb the 
entire indefinite similarity matrix, they perturb the spectrum directly allowing improvements over 
the denoise as well as flip transformations. They also note that Indefinite SVM might perform better 
on sparse kernels because the rank-one update may then allow inference of hidden relationships. 

We observe that Indefinite SVM performs comparably on all USPS examples (slightly better 
for the highly indefinite Simpson kernels), which are relatively easy classification problems. As 
expected, classification using the tangent distance outperforms classification with the Simpson 
score but, as mentioned above, the Simpson score is cheaper to compute. We also note that other 
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Data Set 


# Train 


# Test 






USPS-3-5-SS 


767 


773 


-70.00 


903.94 


USPS-3-5-TD1 


767 


773 


-0.31 


764.72 


USPS-4-6-SS 


829 


857 


-74.38 


819.36 


USPS-4-6-TD1 


829 


857 


-0.72 


771.07 


diabetes-sig 


384 


384 


-.65 


211.62 


german-sig 


500 


500 


-928.10 


8.50 


ala-sig 


803 


802 


-.01 


84.44 



Table 1: Summary statistics for the various data sets used in our experiments. The USPS 
data comes from the USPS handwritten digits database, the other data sets are taken 
from the UCI repository. SS refers to the Simpson kernel, TD1 to the one-sided tangent 
distance kernel, and sig to the sigmoid kernel. Training and testing sets were divided 
randomly. Notice that the Simpson kernels are mostly highly indefinite while the one-sided 
tangent distance kernel is nearly positive semidefmite. The sigmoid kernel is highly indefinite 
depending on the parametrization. Statistics for sigmoid kernels refer to the optimal kernel 
parameterized under cross validation with Indefinite SVM. Spectrums are based on the full 
kernel, i.e. combining training and testing data. 



documented classification results on this USPS data set perform multi-classification, while here 
we only perform binary classification. Classification of the UCI data sets with sigmoid kernels is 
more difficult (as demonstrated by lower performance measures). Indefinite SVM here is the only 
technique that outperforms in at least one of the measures across all three data sets. 

5.2 Generalization with Mercer kernels 

Using this time linear and gaussian (both positive semidefmite, i.e. Mercer) kernels on the USPS 
data set, we now compare classification performance using regular SVM and the penalized ker- 
nel learning problem <\22\) of Section 14.31 which we call PerturbSVM here. We also test these 
two techniques on positive semidefinite kernels formed using noisy USPS data sets (created by 
adding uniformly distributed noise in [-1,1] to each pixel before normalizing to [0,1]), in which case 
PerturbSVM can be seen as optimally denoised support vector machine classification. We again 
cross-validate on a training set and test on the same independent group of examples used in the 
experiments above. Optimal parameters from classification of unperturbed data were used to train 
classifiers for perturbed data. Results are summarized in Table [3j 

These results show that PerturbSVM performs at least as well in almost all cases. As expected, 
noise decreased generalization performance in all experiments. Except in the USPS-4-6-gaussian 
example, the value of p selected was not the highest possible for each test where PerturbSVM 
outperforms SVM in at least one measure; this implies that the support vectors were perturbed to 
improve classification. Overall, when zero or moderate noise is present, PerturbSVM does improve 
performance over regular SVM as shown. When too much noise is present however (for example, 
pixel data with range in [-1,1] was modified with uniform noise in [-2,2] before being normalized to 
[0,1]), the performance of both techniques is comparable. 
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Data Set 


Measure 


Dcnoise 


Flip 


Shift 


Mod SVM 


SVM 


Indefinite SVM 




Accuracy 


95.47 


95.21 


93.27 


96.12 


69.47 


95.73 


USPS-3-5-SS 


Recall 


94.50 


94.50 


94.98 


96.17 


67.94 


97.13 




Average 


94.98 


94.86 


94.12 


96.15 


68.71 


96.43 




Accuracy 


98.58 


98.45 


98.58 


98.19 


98.58 


98.45 


USPS-3-5-TD1 


Recall 


98.56 


98.33 


98.56 


97.85 


98.56 


98.33 




Average 


98.57 


98.39 


98.57 


98.02 


98.57 


98.39 




Accuracy 


98.60 


98.25 


96.73 


98.60 


84.36 


98.25 


USPS-4-6-SS 


Recall 


99.32 


99.32 


96.61 


99.32 


81.72 


99.77 




Average 


98.96 


98.79 


96.67 


98.96 


83.04 


99.01 




Accuracy 


99.30 


99.30 


99.18 


99.18 


99.30 


99.30 


USPS-4-6-TD1 


Recall 


99.77 


99.77 


99.55 


99.55 


99.77 


99.77 




Average 


99.54 


99.54 


99.37 


99.37 


99.54 


99.54 




Accuracy 


74.48 


74.74 


76.56 


76.04 


73.70 


77.08 


diabetes-sig 


Recall 


78.40 


76.80 


89.60 


78.40 


76.40 


89.20 




Average 


76.44 


75.77 


83.08 


77.22 


75.05 


83.14 




Accuracy 


70.40 


70.40 


75.60 


72.60 


69.40 


62.80 


german-sig 


Recall 


78.00 


78.00 


46.67 


66.00 


80.00 


85.33 




Average 


74.20 


74.20 


61.13 


69.30 


74.70 


74.07 




Accuracy 


74.06 


76.18 


75.69 


78.55 


75.69 


82.92 


ala-sig 


Recall 


87.31 


87.82 


87.31 


89.34 


87.82 


81.73 




Average 


80.69 


82.00 


81.50 


83.95 


81.75 


82.32 



Table 2: Indefinite SVM performs favorably for the highly indefinite Simpson kernels. Per- 
formance is comparable for the nearly positive scmidcfinite one-sided tangent distance kernel. 
Comparable performance with sigmoid kernels is more consistent with indefinite SVM across 
data sets. The performance measures are: Accuracy = tp+tn+'fp+fn > -R- ecau = tp+fn ' 
and Average = (Accuracy + Recall) /2. 



5.3 Convergence 

We ran our two algorithms on data sets created by randomly perturbing the four USPS data sets 
used above. Average results and standard deviation are displayed in Figured] in semilog scale (note 
that the codes were not stopped here and that the target duality gap improvement is usually much 
smaller than 10 -8 ). As expected, ACCPM converges much faster (in fact linearly) to a higher 
precision, while each iteration requires solving a linear program of size n. The gradient projection 
method converges faster in the beginning but stalls at higher precision, however each iteration only 
requires a rank one update on an eigenvalue decomposition. 

We finally examine the computing time of IndefiniteSVM using the projected gradient method 
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Unperturbed 


Noisy 


Data Set 


Measure 


SVM 


Perturb SVM 


SVM 


Perturb SVM 




Accuracy 


96.25 


96.12 


90.27 


93.16 


USPS-3-5-linear 


Recall 


95.69 


95.93 


90.00 


92.87 




Average 


95.97 


96.03 


90.14 


93.01 




Accuracy 


99.07 


99.07 


97.39 


97.97 


USPS-4-6-Hnear 


Recall 


99.10 


99.32 


97.34 


98.13 




Average 


99.08 


99.19 


97.36 


98.05 




Accuracy 


97.67 


97.54 


92.11 


93.57 


USPS-3-5-gaussian 


Recall 


98.09 


97.37 


91.27 


92.89 




Average 


97.88 


97.46 


91.69 


93.23 




Accuracy 


99.18 


99.30 


98.00 


97.99 


USPS-4-6-gaussian 


Recall 


99.55 


99.55 


98.15 


98.19 




Average 


99.37 


99.42 


98.08 


98.09 



Table 3: Performance measures for USPS data using linear and gaussian kernels. Unper- 
turbed refers to classification of the original data and Noisy refers to classification of data 
that is perturbed by uniform noise. Perturb SVM perturbs the support vectors to improve 
generalization. However, performance is lower for both techniques in the presence of high 
noise. 



ACCPM Projected Gradient Method 




Iteration Iteration 



Figure 1: Convergence plots for ACCPM (left) and projected gradient method (right) on 
random subsets of the USPS-SS-3-5 data set (average gap versus iteration number, dashed 
lines at plus and minus one standard deviation). ACCPM converges linearly to a higher 
precision while the gradient projection method converges faster in the beginning but stalls 
at a higher precision. 
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and ACCPM and compare them with the SIQCLP method of Chen and Ye (2008). Figure [2] shows 
total runtime (left) and average iteration runtime (right) for varying problem dimensions on an 
example from the USPS data with Simpson kernel. Experiments are averaged over 10 random data 
subsets and we fix C = 10 with a tolerance of .1 for the duality gap. For the projected gradient 
method, increasing p increases the number of iterations to converge; notice that the average time 
per iteration does not vary over p. SIQCLP also requires more iterations to converge for higher 
p, however the average iteration time seems to be less for higher p, so no clear pattern is seen 
when varying p. Note that the number of iterations required varies widely (between 100 and 2000 
iterations in this experiment) as a function of p, C, the chosen kernel and the stepsize. 

Results for ACCPM and SIQCLP are shown only up to dimensions 500 and 300, respectively, 
because this sufficiently demonstrates that the projected gradient method is more efficient. AC- 
CPM clearly suffers from the complexity of the analytic center problem each iteration. However, 
improvements can be made in the SIQCLP implementation such as using a regularized version of an 
efficient MKL solver (e.g. Rakotomamonjy et al. (2008)) to solve problem (|13p rather than MOSEK. 
SIQCLP is also useful because it makes a connection between the indefinite SVM formulation and 
multiple kernel learning. We observed from experiments that the duality gap found from SIQCLP 
is tighter than the upper bound on the duality gap used for the projected gradient method. This 
could potentially be used to create a better stopping condition, however the complexity to derive 
the tighter duality gap (solving regularized MKL) is much higher than that to compute our current 
gap (solving a single SVM). 
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Figure 2: Total time versus dimension (left) and average time per iteration versus di- 
mension (right) using projected gradient and ACCPM IndefiniteSVM and SIQCLP (only 
for total time). The number of iterations for convergence varies from 100 for the smallest 
dimension to 2000 for the largest dimension in this example which uses a Simpson kernel 
on the USPS 3-5 data. 



6 Conclusion 

We have proposed a technique for support vector machine classification with indefinite kernels, 
using a proxy kernel which can be computed explicitly. We also show how this framework can be 
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used to improve generalization performance with potentially noisy Mercer kernels, as well as extend 
it to other kernel methods such as support vector regression and one-class support vector machines. 
We give two provably convergent algorithms for solving this problem on relatively large data sets. 
Our initial experiments show that our method fares quite favorably compared to other techniques 
handling indefinite kernels in the SVM framework and, in the limit, provides a clear interpretation 
for some of these heuristics. 
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